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Abstract 

We propose Macau, a powerful and flexible Bayesian factorization method for het¬ 
erogeneous data. Our model can factorize any set of entities and relations that can 
be represented by a relational model, including tensors and also multiple relations 
for each entity. Macau can also incorporate side information, specifically entity 
and relation features, which are crucial for predicting sparsely observed relations. 
Macau scales to millions of entity instances, hundred millions of observations, 
and sparse entity features with millions of dimensions. To achieve the scale up, 
we specially designed sampling procedure for entity and relation features that re¬ 
lies primarily on noise injection in linear regressions. We show performance and 
advanced features of Macau in a set of experiments, including challenging drug- 
protein activity prediction task. 


1 Introduction 

Matrix factorization (ME) has a long history and a wide range of applications in data sciences, 
engineering and many fields of scientific research. While classical approaches, such as SVD, fac¬ 
torize fully observed matrices, a previous work proposed matrix factorization for partially observed 
matrices [8]. This enabled the direct use of ME in predictive machine learning problems (e.g., in 
collaborative filtering). However, the original formulation [8] can easily overfit the data and it was 

* Adam Arany and Jaak Simm contributed both equally as first authors. 
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improved in Probabilistic Matrix Factorization (PMF) [4]. The main idea in these two papers and 
in subsequent research has been to represent each row and each column by a latent vector of size D 
and find the best match to the observed elements of the matrix: 

min V (i?y - + A„|ju|lp + A„|jv|||., (1) 

U,V ^ 

where u^, Vj G are the latent vectors for ith row and jth column, is the set of matrix cells 
whose value has been observed, Hij G K are the observed values, || • ||f is the Frobenius norm, 
and Xu,^v > 0 are regularization parameters. The last two terms in optimization problem (1), 
introduced in PMF [4], are derived from zero-mean Gaussian priors on and vj and a Gaussian 
noise model on the observed values Rij. 


1.1 Bayesian PMF 

Bayesian PMF (BPMF) [6] extends PMF to full Bayesian inference by introducing common multi¬ 
variate Gaussian priors for the latent variables, one for the rows and one for the columns. To infer 
these two priors from the data, BPMF places fixed uninformative Normal-Wishart hyperpriors on 
them. Let /r„ and A„ (/x„ and A„) be the mean and precision matrix of the Gaussian prior for rows 
(columns) then the model used by BPMF is 

A„|0o) = ]jA/'(u,|/r„, A-i)A/'W(/i„, A„|0o) (2) 

A^|0o) = J|A/'(vj|/r^, A;;i)A/'W(/x„, A„|0o), (3) 

where Af and AfW are Normal and Normal-Wishart distributions and 0o are the fixed hyperparam¬ 
eters of the Normal-Wishart hypetprior. Similarly to PMF the noise model of BPMF is Gaussian: 

p(i?|u,v,Q!«) = (4) 

where an > 0 is the precision parameter and is assumed to be known. From (2)-(4), it is straight¬ 
forward to derive block Gibbs sampler for each latent vector and Vj, and for the parameters of 
the Gaussian priors /x^, A„, Ay. 

It is generally observed that BPMF shows improvement in predictive performance compared to PMF 
(e.g., in the collaborative filtering task of Netflix [6]). Another advantage of BPMF is that it provides 
credibility intervals for the estimates. It should be also noted that BPMF can be easily parallelized 
and can handle large scale data sets, such as the Netflix challenge data, which contains 200 million 
observations. 

1.2 Proposed Method 

In this paper we propose Macau, a powerful and flexible method for factorization of heterogeneous 
data. Its essential features are 

• The ability to factorize wide range of data models, which we represent by a hypergraph 
where entities are nodes and relations are hyperedges. Supported models include the cases 
of ordinary graphs and tensor relations, see Appendix A. 1. 

• The incorporation of features (side information) for any entity and for any relation. 

• Scalability up to millions of entity instances, hundred millions of observations, and sparse 
entity features with millions of dimensions. 

We follow the approach of BPMF by proposing a Gibbs sampler scheme that includes specially 
designed noise injection step for entity and relation features. This enables the scaling of the method 
to millions of sparse or to tens of thousands of dense features. 

The novelty of the proposed method is the combination of all above mentioned functionality into a 
unified Bayesian framework (see Section 3 for detail overview). Also in the context of matrix factor¬ 
ization with entity features Macau carries out MCMC inference rather than variational approximate 
approaches, such as Variational Bayes as proposed in previous research [5]. 
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We apply our method to a standard matrix factorization benchmark of MovieLens, outperforming the 
state-of-the-art MF approaches. Additionally, we explore the performance of Macau in a challeng¬ 
ing biochemistry task of drug-protein activity prediction where we demonstrate the effectiveness of 
the aforementioned characteristics of the method. This task is based on publicly available data from 
ChEMBL [2]. Finally, we report runtime information for private industrial data set from Pharma- 
ceutica containing millions of drug candidates with millions of sparse features and tens of millions 
of observed activity values. 

Our contribution includes an open source package' implementing all of the above mentioned fea¬ 
tures together with multi-core and multi-node parallelization in the Julia language. 

2 Macau 

In this section we outline the probabilistic model for Macau. Then we give an overview of related 
research (Section 3). Finally, we outline the details for the Gibbs sampler (Section 2.4), including 
the crucial noise injection based scheme for sampling the weight variables linking the entity and 
relation features (Section 2.5). 

2.1 Multiple relations and tensor relations 

In practice, data sets can often contain multiple relations between entities (e.g., drugs and proteins, 
see Fig. 3). To handle it in Macau, we consider a relational model with a set of entities £ and a 
set of relations 7Z such that each relation R G TZ can link together two or more entities, i.e., R is 
a tensor. Each relation R maps the instances of its entities to a real number, denoted by i?j where 
j = (ji, • ■ ■, jfc) is the index vector and k is the degree of the relation (i.e., the number of entities 
connected by R). Eormally, i? is a map > K. As in the case of partially observed matrix the 
values Rji,...j^ are partially observed. We denote the latent vector of instance i G N of entity e G £ 

by G 

Each relation R has a Gaussian noise model with precision an > 0 

= JI (5) 

je/ii 

where Iji C is the set of index vectors for which R is observed, £ji is the ordered list of entities 
connected by relation R where the order is the same as in the index vectors j G Ir, 1 is the vector of 
ones and u| = o ... o is the element-wise product of the latent vectors. The conditional 
probability of the observations of all relations is then 

p(n\u,a)= JI 11 A/'(i?j|l^uf",a^'). (6) 

RGUiGlR 

Equation (6) allows Macau to simultaneously factorize more than two entities and multiple relations 
with possibly different degrees. 

2.2 Entity Features 

Entity features are extra information available about instances of entities, often referred to as side- 
information. Eor example, in the case of movie ratings, it could be the genre and the release year for 
movies, or the age and the gender for users. In the example of drug-protein activity modeling, it is 
possible to use substructure information of the drug candidate, represented by a large sparse binary 
vector. The idea exploited in Macau is that we can use this extra information to predict the latent 
vector of the instance and, thus, get more accurate factorization, especially for entities that have few 
or no observations. 

(e) 

Eirst, let us write the standard Gaussian prior (2), used in BPME, for the latent variable ' of an 
instance i of entity e: 

= A/'(ufVe>Aj:'), (7) 

’URL to the package: https : / / git hub . com/ jaak- s/Bayes ianDataFus ion . j 1 
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where /Xg and Ag are the common prior mean and precision matrix for entity e, respectively. To 
incorporate the instance’s feature G we add a term /3jinto the Gaussian mean: 


= AA(u|®Ve +/3jxf\Ag 


( 8 ) 


where j3e G ^ ^ is the weight matrix for the entity features and Fg is the dimensionality of the 

features. Equation (8) can be interpreted as a linear model for the latent vectors. If an instance 
does not have any observations then the distribution of its latent variable is fully determined by (8), 
because there are no terms involving its latent variable in (6). On the other hand, if the instance has 
many observations its features will have only a minor impact. 


To have a full Bayesian treatment for /3e, we introduce a zero mean multivariate normal as its prior: 

p(/3e|Ag,A;3j = A/'(vec(/3e)|0, A;:^ (g) (A/3.I)”^) (9) 

oc (10) 

where vec(/3e) is the vectorization of ^g, g) denotes the Kronecker product and X^^ > 0 is the 
diagonal element of the precision matrix. The inclusion of Ag (the precision matrix of the latent 
vectors) in (9) is crucial for deriving a computationally efficient noise injection sampler, described 
in detail in Section 2.5. 


As the choice of Xp^ is problem dependent, we set a gamma distribution as its hyperprior, as used in 
similar context for neural networks [3]: 

oc Aj^f”^exp(-^A;3j, (H) 

where p and i/ are fixed hyperparameters, which are both set to 1 in the experiments. 


2.3 Relation Features 

Often there is extra information regarding observations (e.g., the day (from the release) when the 
user went to see the movie or the temperature of the chemical experiment). However, this data is not 
linked to a single entity instance but instead to the observation (e.g., a particular user-movie pair). 
If these features are fully observed for a particular relation R, then Macau can incorporate them 
directly into the observation model. Let xj^^ G be the relation feature for observation i?j, then 
the previous observation model (5) for relation R is replaced by 

p{R\u,aR)= A/'(i?j|l^uf«(12) 

je/fl 

where G R'^'^ is the weight vector. The treatment of Pr is similar to /?e, i.e., Macau uses zero 
mean Gaussian prior on Pr with precision A^g^ > 0 that has gamma hyperprior: 

p(/^fl|A/3H)=AA(/3fl|0,A^,I) (13) 

=SiX(tJn,v), (14) 

where as before /i and z/ are fixed hyperparameters, set to 1 in experiments. 


2.4 Gibbs Sampler 

Gibbs sampling is used to sample from the posterior of the model variables. In this section we 
present the conditional distributions of the Gibbs sampler for all variables (except / 3 e and ( 3 r for 
which we propose a specially designed sampler in Section 2.5). 

2.4.1 Latent vectors 

(e) 

Based on (8) and (12) the conditional probability for u) is 

p(uf^|F,u,x,/3, A,q;, Ag) =A/'(uf ^1”^) (15) 

cx n n (16) 

ReTie jelR{e,i) 

X A/'(u|''Ve + A-i), 
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where 


Sr' 






ReTZs ieiR(e,i) 


^(e)* ^ 1 j + ^ an (i?j - /3jx]^^)^ j , 

R££r ielR{e,i) 


where TZe is the set of relations that the entity e is linked to, denotes element-wise divi¬ 

sion^, /fl(e, i) C /jj is the set of indexes of observations to which instance i of entity e is linked to, 
Le., all the observed data in relation R for instance i. The above equations use a shorthand that if 
relation R does not have features then is 0 and similarly if entity e does not have features 

then Pj x-®^ is 0. 


2.4.2 Gaussian priors 

Macau also uses the same Normal-Wishart hyperprior for and Ag as BPMF [6]: 

p(/rg,Ae|0o) = A/'(/rJ/Xo,(^oAe)"^)W(Ae|Wo,z^o), (17) 

where the hyperparameters are set to uninformative values of /Xg = 0 , Pq = 2, Wg = I (the identity 
matrix), and i/q — D. Combining the hyperprior (17) with (8) we get conditional probability 

p(/x„Ae|u(^),x,^,0o)=AA(/xJ/xS,(/3oAe)-')W(Ae|fCg*,:/g*). (18) 

Because of space constraints the formulas for /Xg, /3g, Wq, Vq are presented in the Appendix A. 2. 
One of the essential differences compared to BPMF is that in BPMF the Gaussian priors model the 

(e) (el T (e) 

latent vectors whereas in Macau they model the residual — /JJ x,- . 


2.4.3 Precision parameter for the weight vector 

From (9) and (11) we can derive the conditional probability for as 

P(A/3, \/ 3 e, Ae, /X> = ^(A/3, |/i, 


where 


V = Ff,D + u 


F = 


[FeD F u)pL 


^tr(/3j/3eAe)' 

The conditional probability for Xp^ is analogous and is described in Appendix A.3. 


(19) 


2.5 Noise Injection Sampler 

From (8) and (9) we can write out the conditional probability for Pe 

p(/ 3 e|/Xg, Ae,U,X, Xp^) 

^ exp (-/Xg -/3jxfVAe(u|''^ - /x^ -/3jx['=^) - ]-Xp^ tr(/3eAe/3j) 


( 20 ) 


Let us denote U = — /Xg,..., — /Xg]^ and X = , x^^]^ then, because both the 

likelihood and the prior contain Ag, we can factorize Ag out: 

p(/ 3 e|/Xg, Ag,u,x, A; 3 j (X exp tr[((U - X/3g)^(U - X^g) + XpX (21) 


^ The formula assumes that is only present once. To handle such cases where, e.g., Sr = (e, e), and 
there are observations on the diagonal, equations should be modified. 
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and the Gaussian mean and precision can be derived (for details see Appendix A. 4): 

p(^e|/Xg, Ae,u,x, A/ 3 J oc exp vec(/3e - KY ® (X^x + A; 3 j)) vec(^e - /3e)^ ( 22 ) 

where Y = (X^X +A^^I)“^X^U is the mean and Ag® (X^X +A/j^I) is the precision of the pos¬ 
terior. However, even for moderate feature dimensions Fe the standard sampling of the multivariate 
Gaussian is computationally intractable because the size of the precision matrix is DFg x DFg. 

By exploiting the Kronecker product structure of the precision matrix and the existence of (X^X -f 
A/ 3 ^I) in both the mean and the precision we derive an alternative approach. A sample of (3^ from 
( 22 ) can be obtained by solving a linear system for /?: 

(XTx + A^J)/3 = XT(U + Ei) +v^E 2 , (23) 

where each row of matrices Ei € and E 2 G is sampled from A/'(0, Aj^). The 

correctness of (23) is proven in Appendix A.5. The derivation of noise injection sampler for the 
weight vector f3ji for relation features is analogous, with the difference that the linear system has 
only single right-hand side. 

Thus, to sample /3e we need to solve a linear system of size F^. x with D different right-hand 
sides. If Fe is medium size (up to 20,000) we propose to use direct solvers^. If X is sparse we 
can tackle high-dimensional systems by solving each right-hand side separately by using iterative 
method of conjugate gradient (CG). CG only requires multiplication of X (and X^) with a vector 
and can handle cases where Fe is in the order of millions. 

3 Related Research 

There are several extensions already proposed to BPMF that are related to our work. Bayesian 
Probabilistic Tensor Factorization [9] extends BPMF to the factorization of a single 3-way relation 
(tensor) without entity and relation features. Like BPMF their approach uses Gibbs sampler. 

Singh and Gordon [7] propose Bayesian MF method that can link together more than one 2-way 
relation (matrix). Their sampling approach is analogous to BPMF except using Hamiltonian Monte 
Carlo within Gibbs where each latent vector is sampled separately by using Hamiltonian Monte 
Carlo. Their method does not have support for tensors, entity features or relation features. The 
lack of support for tensors also means their method cannot support multiple relations between two 
entities, which is useful, for example, in the case of drug-protein activity modeling (see Section 4.2 
for the experiment on ChEMBL data with two relations between potential drugs and proteins). 

Hierarchical Bayesian Matrix Factorization with Side Information (HBMFSI) [5] is a method for the 
special case of factorizing a single matrix with entity features based on Variational Bayes. HBMFSI 
does not allow the model to use relation features as in Macau. However, they propose to add the 
concatenation of row entity features and column entity features in the same way as relation features 
in Macau, i.e. x-^^ = (x-’^°’^\ 

4 Experiments 

This section gives results for 1) the standard MF benchmark MovieLens and 2) a challenging bio¬ 
chemical problem based on the ChEMBL data set [2], and reports runtimes of Macau on a large-scale 
industrial drug-protein activity data set. The performance reported is mean RMSE and the error bars 
in figures represent standard deviations. All experiments are repeated 10 times. 

4.1 MovieLens Benchmark 

The MovieLens data set consists of a single matrix of movie-user ratings from 6,040 users and 
3,952 movies. There are total of 1,000,209 ratings taking values from 1.0 to 5.0. Recent research 

^On a system with 2 Intel Xeon E5-2699 v3 CPUs using 8 cores Julia takes 25 seconds to solve a 20,000 x 
20,000 system with 60 right-hand sides (= latent dimensions). 
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[1] has investigated in detail the noise level in movie ratings. Amatriain et al. made a conservative 
estimate of between-trial RMSE of 0.8156. Based on that estimate, we chose to use a = 1.5 in 
our experiments. The data set contains 29 and 18 dimensional entity features for users and movies, 
respectively. We compare Macau against HBMFSI"^, which is the state-of-the-art MF approach with 
entity features, using the same evaluation setup used in their paper [5]. Namely, one half of the 
ratings are randomly set as the test set and another half as the training set. The methods compared 
are 1) Macau-E: Macau with entity features, 2) Macau-ER: Macau with entity features and relation 
features constructed as in HBMFSI (see Section 3), 3) HBMFSI, and 4) BPMF. All methods use 
latent dimension H = 30. It should be noted that the relative performance between the methods 
was similar when we used H = 10 (data not shown). The results show that both Macau setups 
outperform HBMFSI and BPMF, see Figure 1 . 




Figure 1: Results for MovieFens experi- Figure 2: Results for ChEMBF experiments, 

ments. The p-value of the two-sided t-test be- BPMF and Macau use D = 30. 

tween Macau-ER and HBMFSI is lower than 

0 . 0001 . 


4.2 ChEMBL drug-protein activity prediction 

The prediction of drug and protein interactions is crucial for the development of new drugs. In 
this case study we focus on the interaction measure IC50, which measures the concentration of the 
drug necessary to inhibit the activity of the protein by 50%. We prepared a data set from the public 
bioactivity database ChEMBF[2] Version 19. First, we selected proteins that had at least 200 IC50 
measurements, and then we kept drugs with 3 or more IC50 measurements. Finally, we filtered 
out some measurements with clear data errors (these were also reported to ChEMBF). The final 
numbers for small molecules and proteins are 15,073 and 346, respectively, with total of 59,280 
IC50 measurements. In all of the ChEMBF experiments we model logj^Q of IC50 and set a = 5.0, 
because this corresponds to a reasonable standard deviation of 0.45^. For drugs, we use sparse 
features (substructure fingerprint) with Fdmg = 105,672, for proteins, we use dense features (based 
on protein sequence) with Fp^-ot = 20. 

In the first experiment, we compare Macau with entity features for drugs and proteins to BPMF, as 
well as individual ridge regression based on drug features, for each protein. Macau and BPMF use 
30 latent dimensions, because we observed it is sufficient for good performance, see Appendix A.6. 
To tune the regularization parameter of ridge regression of each protein, we used 5-fold inner cross- 
validation. A test set containing 20% of the observations is chosen at random. The strong per¬ 
formance of Macau over BPMF, as seen in Figure 2, is expected because Macau gives the most 
advantage when the relation is sparsely observed. 

In the second experiment we want to improve IC50 predictions by introducing a new relation, Ki, 
between drugs and proteins. Ki measures the binding affinity of the drug for a protein. While 
related to IC50, it measures a different biochemical aspect of the interaction. We thus expect that it 
contributes additional information for our task. In Macau, multiple relations between two entities are 
represented as a tensor by creating a third entity denoting the type of interaction, see Figure 3. The 
dimensions of the tensor are 15,073 x 346 x 2, and the Ki part contains 5,121 observations. As before 

"'in the experiments we used the MATLAB implementation of HBMFSI provided by the authors. 

^It is possible to enhance the performance by tuning/sampling a. 
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Figure 3: The IC50+Ki model has two Types 
of relations between entity Drug and entity 
Protein. 



Figure 4; Comparison of 1C50 and lC50+Ki 
Macau models using D = 30. 



Figure 5; The IC50+Pheno model has three 
entities and two relations. 



Figure 6: Comparison of IC50 and 

lC50+Pheno models. 


the test set contains measurements only from the IC50 part. As can be seen from Figure 4 the tensor 
model lC50+Ki signihcantly outperforms the single relation model with only IC50 (p < 0.0001). 

In the hnal experiment, we explore the effect of connecting drugs to two relations. For the drugs 
in our IC50 data set, we compiled all cancer assays from ChEMBL that had at least 20 compounds 
in that set. From these, we created a new relation Pheno with 7, 552 observations measuring the 
phenotypic effect of drugs in 182 Assays, which is depicted in Figure 5. Because the effects of 
assays are not directly linked to any specific protein, we expected weaker effect than from the Ki 
data. Therefore, for the comparison of IC50 and IC50+Pheno models the 20% test set of IC50 mea¬ 
surements are selected only from that 1,678 compounds that have Pheno observations. In Figure 6 
we can observe the ICSO-tPheno model outperforms the IC50 model when appropriately large latent 
dimension is used. It is interesting to note, that with D — 10 the ICSO-tPheno model is slightly 
losing in performance, which can be an evidence that, with too small D, adding more relations can 
result in an overcrowded latent space. 

4.3 Runtime on Large-scale Industrial Data Set 

For large scale problems, our implementation has multi-core and multi-node parallelization. The 
sampling of latent vectors can be parallelized straightforwardly as the the latent vectors of a single 
entity are, in our use cases, independent of each other and can be sampled in parallel. The only 
difference here compared to BPMF is that, for entities that have features Macau requires computing 
/3g X) , for which our implementation provides parallelization as well. 

In the parallelization of (23), if Fg < 20000, the direct solver is fast enough not to require addi¬ 
tional parallelization. As mentioned in the case of sparse features, Macau uses CG to solve (23) for 
each right-hand side separately. For each CG, our implementation parallelizes the matrix product 


8 















































operations in a multi-core way and CGs can be distributed across multiple processes and thus can 
be parallelized over multiple nodes. 

The large-scale data set is a subset of a proprietary data set from Janssen Pharmaceutica containing 
millions of compounds. The subset has more than 1.8M compounds and more than 1,000 proteins 
for a total of several tens of millions of compound-protein measurements. Here we report the 
computation times of Macau on two types of features for the compounds using systems with 2 Intel 
Xeon E5-2699 v3 CPUs. Firstly, for the feature dimension ~ 6000 and D = 30, the computation 
of the full Gibbs step using 8 cores of a single node takes about 40 seconds (using a direct solver for 
(23)). Secondly, for the feature dimension Fg « 4,000,000 having sparsity 0.002% and D = 30 the 
computation of the full Gibbs step using 10 cores per CG and total of 15 nodes (2 CGs per node) 
takes about 600 seconds. We observed that 1,000 Gibbs iterations (from which 800 were discarded 
as burn-in) were sufficient to reach a stable posterior. 


5 Conclusion 

The best of our knowledge, this paper proposes the first Bayesian factorization method that allows 

to handle tensors, multiple relations, and entity and relation features. 
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A Appendices 

A.1 Data Models Supported by Macau 

Let M be a Macau model (i.e., hypergraph) specifying the entities and their relations. We say M is 
factorizable if given large enough latent dimension D we can choose latent vectors for every entity 
in such a way that they can fit any relation data with arbitrarily small error. 

However, it is clear that some models are not factorizable. For example, consider two entities ei 
and 62 , with the latent vectors u and v, respectively, and two relations, R and S, between them. 
Since both relations are modelled by the same formula Vj it is not possible to fit arbitrary data. 
Actually we can only fit the case when the two relations are equal (i.e., Rij = Sij). 

Let us define the relation R in model M to be factorizable if in the single latent dimensional case 
{D — 1) it is possible to specify arbitrary values to the latent variables of its entities, e S 
while keeping the predictions for all other relations in the model M equal to 0. It is straight¬ 
forward to see that if R is factorizable we can fit any observed data of the relation R by adding new 
latent dimensions without affecting the predictions for other relations. Additionally, the fact that all 
relations of M are factorizable implies M is factorizable, because we can always add new latent 
dimensions that only effect a specific relation and, thus, fit all relations as accurately as needed. 

It is easy to show that if all pairs of entities in the hypergraph (Macau model) M have at most one 
hyperedge (relation) between them, the model M is factorizable. To see that this is true consider 
a relation R in such a model. R is factorizable because if we set the latent variables of the non¬ 
participating entities, e 8r, equal to 0 then the predictions of the other relations will be zero as 
they all contain at least one non-participating entity. 

From this we can see that Macau can factorize any 

• ordinary undirected graph, 

• acyclic hypergraph. 

Additionally, it is also possible to tensorize simple cases when there are multiple edges between 
two entities. For example, the model IC50H-Ki in our paper has two relations, namely IC50 and Ki, 
between the entities Drug and Protein. To handle it in Macau we represent it as a tensor with three 
modes: Drug, Protein, Type, where the third mode specifies the relation (either IC50 or Ki). 


A.2 Sampling of Gaussian Priors of the Latent Variables 


In Macau and Ae are interpreted as the model for the residuals — /3j . The conditional 

joint probability /Xg, Ag used in sampler is 

p(/Xg,Ag|u(^),x,/3,0o)=AA(/Xg|/xS,(/3SAg)-i)W(Ag|IV*,:/o*)> 


where 


Mo = 


/3oMo + Agt/ 


j3n + Nf. 

P* = /3o + Ag, 

fo = t'o Ag -f Fe , 


+ NeS + /3 oMoMc[ - + >^pJe /3e, 


1 


(24) 


(25) 

(26) 

(27) 

(28) 
(29) 


(30) 


where Ag is the number of instances of entity e. Note that the terms Pe in (28) and Fg in (27) 
come due to the dependence of the prior of /3g on Ag, see (9). 
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A.3 Gibbs SampUng of Precision Parameter of Weight Vector of Relation Features 


Recall that the is the diagonal value for the precision variable in the prior for weight vector 
and that the hyperprior of is gamma distribution with fixed parameters /i and i/: 

=AA(/3fl|0,A^J) 

where 

oc A^/^”^exp(-^A;3„). 

The conditional probability is then 

OCp{PR\Xfii,)p{Xj3jfi,I^) 

exp 

v + P/3]^Pr 


cx A^^ exp ( -Xp^ 


CX AL exp -A,3* 


ot G{XfiJfl,i>), 


2p 

{V + PliRPR){FR + v) 
‘^{Fr + v)p 


where 


V = Fr + v, 

{Fr + v)p 


M = 


v + pI 3 ]^( 3 r 

where Fr is the dimensionality of the relation features of R. 


(31) 

(32) 

(33) 

(34) 

(35) 


(36) 

(37) 

(38) 

(39) 

(40) 


A.4 Derivation of Gaussian Mean and Precision for the Weight Vector of the Entity Features 


The conditional probability for /3e is 

P(/3e|Ate:A-e,U,X, A/gJ 
1 


CX exp ( -- tr[((U - X^e) ' (U - X^e) + XpX Pe)^e] 


(41) 

(42) 


Next we use the link X^U = (X^X + XpVj fie (from the definition of /3e) and expand the inner 
part of (42): 

(U - X/3e)T (U - X/3e) + XpXPe (43) 

= UT u + /3J XT X/3e - UT X/3e - /3J XT U + A^g, pj /3e. (44) 

= UTu + pj (XTx + Xpl)Pe - UTX/3e - pj XTu (45) 

= UTU + Pj (XTX + XpI)Pe - pj (XTX + XpI)Pe - Pj + X^l)Pe (46) 

= UTU + Pj (XTX + XpI)Pe - Pj (XTX + XpI)Pe - Pj (X^X + A;gJ)/3e 

+ Pj (XTX + XfiI)Pe - Pj (XTX + X0l)Pe (47) 

= IJTu+(/3j - /3e)T(xTx + XpI){pJ - Pe) - PJ (XTx + Xpl)^ (48) 

'- „ -^ 

const 

Next we plug the non-constant part back to (42) 

A-e,U,X, A/jJ (49) 

(50) 

(51) 


CX exp ( -- tr[((/3j - /3e) ' (X ' X + - pe))Ae] ] . 


cx exp 1^-- vec(/3e - Pe) (Ae ® (X ' X + A/gJ)) vec(/3e - Pe) 

where we can clearly see the precision and mean of Pe. 
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A.5 Correctness of Noise Injection Sampler 

Let X, U, Ae be matrices described in Gibbs sampling section of /3e. Here we prove a more general 
version of the sampler where instead of precision matrix we allow any positive definite matrix 

A € Let -y/A be a matrix such that >/Av/A = A. 

Lemma A.l. Let Ei S ^ ^ and E 2 G be matrices where their each row is independently 

generated from A“^) and let the variable f3 be the solution to the linear system 

(XTx + A);3 = X^(U + Ei) + VaE2, (52) 

then vec(/3) is distributed by multinomial Gaussian distribution with mean vec((X^X + 
A)“^X^U) and precision Ae 0 (X^X + A). 


Proof. From 

/3 = (X^X + A)-1(X^(U + Ei) + VaE2) (53) 

it is clear that f] is distributed by Gaussian as it constructed by affine transformations and sums of 
Gaussian variables. As Ei and E 2 have zero mean we get 

E[/3] = (X^X + A)-i(X^E[U + El] + %/AE[E2]) (54) 

= (X^Xd-A)-iX^U, (55) 

proving the correctness of the mean. 


For the precision we investigate the covariance between i and j column of /3. In what follows we 
use notation Ai to denote the column i of matrix A. Let’s also denote K = (X^X + A)“^, giving 
us E[/3] = KX^U, then 


cav{Pi,Pj) = E 
= E 


{h-nhWi-nhW 

(k(XT(U + El), + (yAE2),) - KX^U,) 


• (k(XT(U + El), + (VAE2),) - KX^U, 


T 


= KE 


X^Ei), + (VAE 2 ),') f(X^Ei), + (yAE2) 


K 


= KXTe[(Ei),((Ei),)T]XK 
+ KX^E [(Ei),((E 2 ),)T] Va^K 
+ kVAE[(E2),((Ei),)T]XK 
+ kVae [(E 2 ),((E 2 ),)T] VjCk. 


The expectations in the first and last term of the equation (59) give 


E[(Ei),((Ei),)T] = (A,-i),,,Iiv. 

E[(E2),((E2),)T] = (A-1),.,I^^, 


(56) 

(57) 


(58) 

(59) 


(60) 

( 61 ) 


where I„ is n-dimensional identity matrix. The middle two terms are equal to zero because 
E [(E 2 )i(Ei)J] = 0 due to Ei and E 2 being zero mean and independent of each other. Thus, 
we get 


cov0,Jj) = KX^(A-i),,,XK + KVA(A;i)i,,VA^K (62) 

= (A;1),.,K(XTx + A)K (63) 

= (A-')..,K (64) 

= (A-1),.,(XTx + A)-i. (65) 

This means the covariance matrix of vec(^) is A“^ G) (X^X + A)“^ and thus the precision is 

Ae®(XTx + A). □ 
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Figure 7: The IC50 model with different latent dimensions. 

A.6 Macau Performance with Different Latent Dimensions in ChEMBL 

Figure 7 shows the effect of the number of latent dimensions on performance of Macau on the model 
where a matrix of 1C50 observations are factorized using entity features on drugs and proteins. 
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